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ABSTRACT 


Parallel processing structures such as multiprocessor arrays and 
pipelining enhance throughput tremendously for suitable algorithms 
having high degrees of concurrency. However, if the time to process 
different workpackets becomes irregular, much of the advantage offer 
traditional sequential processing systems may be lost. 

In an attempt to produce a more flexible response to workload 
demands, a transputer workfarm was investigated. Two network 
topologies, a linear model and a tree model, were built using the 
transputer as the processing element (PE), or worker. An algorithm 
was developed which could be run independently on all workers in the 
workfarm. Each worker produced results independent of the other 
workers. By altering specific variables within the algorithm, the 
network performance could be changed. The results from this thesis 
illustrate how these parameters affect each network and provide 
comparative information between the linear model and the tree 


model. 
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I. INTRODUCTION 


A. BACKGROUND 

Distributed computing systems provide an exciting avenue for 
enhancing performance of hardware and software systems. Traditional 
multiprocessor systems utilize a one- or two-dimensional array of 
processors using nearest-neighbor or pipeline computing structures 
[Ref. 1]. Many networks will operate efficiently when each processor 
Shares the computational load, i.e., achieves load-balancing. 
Programmers and users have a unique environment in which 
algorithms possessing a concurrent nature can be run with much hig- 
her efficiencies than previously encountered in sequential 
programming structures. Most solutions to existing problems have 
been approached from a sequential processing aspect. A significant 
percentage of these problems have varying degrees of underlying 
concurrency which may be exploited. The most dramatic advantage to 
be gained by exploiting the concurrent nature of the algorithms is the 
increased throughput. By dividing a task into several smaller tasks and 
processing them concurrently on separate processing elements (PEs), 
a tremendous decrease in processing time may be observed. 

Once a task has been determined to have some inherent 
concurrency, either a system must be built to conform to the nature of 
the problem, or the problem must be configured to run on existing 
parallel processing systems. Stone, [Ref. 1], points out that there are 


several different parallel processing structures and philosophies 


available, each with its own distinct advantage. The purpose of this 
thesis is to investigate one of the structures called a workfarm. The 
PE's within the workfarm may be configured into many topologies and 
therefore only two will be looked at in depth, the linear topology and 
the tree topology. 


B WORKFARM CONCEPT 

Suppose a problem can be broken into a finite number of identical 
parts, each of which takes a different amount of time to solve. Due to 
varying processing times, nearest-neighbor or pipeline designs may 
encounter difficulties caused by an unbalanced work load on adjacent 


processors resulting in communications delays. 


A logical alternative is a processor workfarm. In a workfarm, each 
processor (worker) executes the same functional block of code on 
individual workpackets independently of adjacent workers. This 
design is inherently different from the nearest-neighbor arrays and 
pipelines in that adjacent workers operate asynchronously with 
respect to each other. A controlling processor distributes the 
workpackets to the workfarm whenever a results packet is returned 
from the network. This synchronization is handled within the 


software at the controller level. 


In the case of a linear model, Figure 1.1, a controller distributes 
workpackets to the first worker in the workfarm. The first worker 
will process as many packets as it is able to handle and send the 


remaining workpackets on to the rest of the workfarm. This happens 
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for each successive worker until the end worker is encountered. 
Results are returned to the controller by trickling back up the linear 
network in the opposite direction to the flow of work. 


Tram 


_ a (controller) 
a 


workfarm 


Figure 1.1 Linear Model Workfarm 


The sensitivity of a given topology is analyzed by altering various 
parameters within the algorithm which might affect the performance 
of the workfarm. Parameters which may be altered to observe this 
effect include the input buffersize of each worker (how many 
workpackets a worker may buffer before having to pass additional 
workpackets on), the size of the workpackets (how many iterations 
must be done per workpacket), etc. The results presented in this 
thesis indicate that buffersize and the size of the workpackets 
(stripsize) are limiting factors in the linear model workfarm. In 
addition, two modes of operation for the linear model have been 
studied, non-addressed and addressed. In both cases the controller 
sends a new workpacket out to the network each time it receives 


results from a previously processed workpacket. In the non- 


addressed mode the first available worker encountered by the 
workpacket will grab the workpacket for processing or buffering, if 
sufficient space is available. On the other hand, the same controller in 
the address mode will send the workpacket directly to the specific 
worker in the chain which just returned a results packet. Each of 


these designs may have its own advantages. 


II. THE TRANSPUTER 


A OVERVIEW 

The T800 [Ref. 2] transputer is just one in a family of transputers 
produced by INMOS. The parallel architecture, augmented with the 
CSP-based (Communicating Sequential Processes) [Ref. 3] language, 
OCCAM [Ref. 4], makes it an ideal and inexpensive tool to conduct 
Research im topics of concurrency. Due to the serial link 
intercommunication structure of the transputer, a variety of network 
topologies can be easily implemented. Prior to discussing these 
topologies, a greater understanding of transputer architecture is 


required. 


B ARCHITECTURE 

The T800 is a single chip implementation of what is traditionally a 
separate microprocessor and several support chips. Figure 2.1 shows 
a simple breakdown of the T800 architecture. Transputer regions are 
subdivided into a RISC technology microprocessor, on-chip RAM, four 
paired input/output serial links, an external memory interface module 
and a systems services module. In addition the T800 model 


incorporates an on-chip floating point unit. 
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Figure 2.1 T800 Transputer Block Diagram 


l. Processor 


The processor is a RISC machine containing instruction logic, 
an instruction pointer, a workspace pointer, an operand register and 
three source/destination stack-like registers. All registers are 32 bits 


long. Four Gbytes of memory may be addressed. The first 4 Kbytes of 
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this address space are on-chip RAM. The various registers have the 


following functions: 


¢ workspace pointer - points to an area of memory containing 


local variables 


e instruction pointer - points to the next instruction to be 


executed 
¢ operand register - used to form instruction operands 
e A, B, C registers - evaluation stack 


a. Instructions 

Instructions refer to the stack implicitly. For example, 
evaluation of an add instruction adds the contents of A to the contents 
of B and places the sum in A. Overflow protection is not provided in 
hardware as this is easily handled by the compiler. 

All instructions in the instruction set have the same format 
and are representatives of the most commonly used instructions in 
most programs. Each instruction is a single byte which can be 
decomposed into two 4 bit segments. The upper 4 bits contain the 
function code and the 4 lower order bits are a data value. This alone 
limits the number of functions to 16. However, most program 
operations involve the loading of small literal values and the loading 
and storing of one of a small number of variables. Two of the 16 
functions, prefix and negative prefix, provide for extending the length 


of the instruction operand. The prefix instruction first loads its 4 data 


bits into the operand register and then shifts this value to the left 4 
bits. The negative prefix instruction merely complements the 
operand register prior to executing the shift. This scheme can create 
operands in the range of -256 to 255 by simply using just one of the 


appropriate prefix instructions. 


b. Concurrency 

The fundamental programming structure is known as a 
process, which is simply a sequence of instructions. A single 
transputer can run concurrent processes independent of a network of 
transputers. This is allowed by multiplexing high and low priority 
processes. Low priority processes are run whenever high priority 
processes are idle or are waiting for communications. Typically high 
priority processes are of short duration while most low priority 
processes are of longer durations. The user defines what processes 
run as high priority and which run as low priority. These are user- 


defined and eliminate the need for a kernel. 


2. FPU 
The T800 also houses a 64 bit floating point unit. The FPU 
performs single and double length arithmetic conformed to floating 
point standard ANSI-IEEE 754-1985. This FPU is capable of 
sustaining 2.25 MFLOPS processing concurrently with the CPU on a 
30 MHz transputer model. However, for this thesis the T800 was 
operated at 20 MHz. 


3. Memory 
The T800 is configured with 4KBytes of on chip static 
random access memory. This memory serves as the lowest address 
block for the 4 GBytes of memory addressable by the T800. The 
remainder of the 4 GBytes must be supplied as external memory via a 
32 bit bus. The 4KBytes of on chip RAM may be accessed via the 32 


bit internal bus for read/write operations in one clock cycle. 


4. Serial Links 
The four pairs of input/output serial links provide the means 
for building networks of processors. These links allow the 
implementation of CSP by providing a means to make direct 


communications channels between processors within the network. 


5. Timers 
There are two hardware timers within the T800 which 
operate at two distinct levels. The high level timer provides 1 psec 
ticks for the system whereas the low priority clock produces ticks of 


64 usec intervals. 


OT. WORKFARM TOPOLOGY 


A. Overview 

The workfarms to be investigated in this thesis required the ability 
to be easily configured into a variety of network topologies. The T800 
transputer and its programming language, OCCAM, were chosen 
because of the simplicity in implementing multiprocessor networks 
with them. As previously stated, all parallel processing structures have 
advantages and disadvantages specific to their design. For this reason 
two different topologies have been studied in this thesis to determine 
their efficiencies relative to a given independent algorithm as the test- 
bed. The two topologies chosen were the linear model and the tree 
model. Since the goal was to determine the most efficient model for 
the given independent algorithm, each model was tested by varying 
specific parameters within the algorithm which might have influence 
on the interprocessor communications and actual time spent doing 
calculations. In each case the desired goal was to minimize the time 
to complete a given batch of work and to observe the resulting load 
balance for all workers in the network. The models are looked at in 
more detail in the following sections. 

Given a sufficient number of transputers, imagination is the only 
limit to the number and design of possible topologies. However, 
practicality from a hardware and software implementation standpoint 
would argue that more regular and symmetrical structures be 


investigated. With regards to these considerations, two other 


topologies were of interest given the number of transputers available 
for this network study. However, due to time considerations, a loop 
model and a cube model could not be implemented. A basic 
discussion of the loop and cube topologies will be included in the next 


section, but it will be limited to hardware configuration only. 


B. Models 
1. Linear 

The linear model is a straight forward application of the 
transputers located on the BOO3 boards and the TRAM. A block 
diagram of the linear topology is shown in Figure 1.1. The model is 
implemented by simply connecting the transputers in a one 
dimensional array and ensuring that the appropriate link connections 
are defined in the software. A diagram of the linear workfarm model 


is Shown in Figure 3.1. 
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Figure 3.1 Linear Model Physical Link Diagram 





The arrows depict the direction of data flow along the links. Note that 
these are not bidirectional links, but rather both the in and out 
complements for each link are used as a pair to provide output and 
input communications. Workpackets are generated in the controller 
and sent to the network via the outward flow of the links. Results are 


passed back up the links to the controller in the other direction. 


2. Tree 
The tree model is a more significant deviation from a simple 
linear design topology. The TRAM has three pairs of output and input 
links to choose from for connecting to the workfarm network. For the 
tree model, two of these link pairs are used. Each link pair connects 


to one of the BOO3 boards. The root worker of the BOO3 board already 


has links to its two orthogonally adjacent neighbors and a simple 
hardwire connection is made from its fourth available link pair to the 
diagonally positioned worker on the board. This configuration creates 
two branches from the TRAM with each branch splitting into three 
more branches. The configuration is shown abstractly in Figure 3.2 


and schematically in Figure 3.3. 


controller 


Figure 3.2 Tree Model Block Diagram 
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Figure 3.3 Tree Model Physical Link Diagram 





3. Loop 

The loop topology is a direct derivative of the linear model. 
The data flow is unidirectional instead of bidirectional as in linear 
model. Figure 3.4 depicts the loop model in a block diagram. 
Workpackets flow from the controller to the network. The results 
continue to flow in the same direction as workpackets and are 
transmitted by the last worker in the chain back to the controller via a 
separate link. Figure 3.5 is a proposed hardware connection of the 


T800 transputers. 


Tram 
y 


(controller) 





workfarm 


Figure 3.4 Loop Model Block Diagram 
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Figure 3.5 Loop Model Physical Link Diagram 





4. Cube 
The cube is the most complicated of the four models. It 


combines features of the tree and the loop topologies. Figure 3.6 is an 


abstract diagram of the nodes and the link connections. Output from 
the controller would flow to the first worker where the workpackets 
(and results) could be forwarded to one of the three adjacent nodes 
along three different channels depending on availability. Each of these 
three adjacent nodes could in turn forward the workpackets (and 
results) to two other nodes, as shown in the figure. Ultimately all 
results will converge on the last worker in the cube which sends the 
results on to the controller via a separate link. 


Tram 


a (controller) 





Figure 3.6 Cube Model Block Diagram 
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C. Algorithm 


1. General Algorithm Structure for All Topologies 

As mentioned in the previous section, there are several 
possible topologies in organizing a workfarm, e.g., linear, loop, tree, 
cube, and many others. The linear and tree topologies were 
investigated in this project. To test these topologies, an algorithm was 
designed to produce independently processable workpackets. Each 
topology employs a similar set of processes to accomplish the task. 
The processes implemented may differ slightly due to the 
communication requirements in the different topologies. Process 
names will be in bold print for the remainder of this paper. Common 
to each of the workfarms is the controller. The purpose of the 
controller is to coordinate the various processes necessary to 1) 
generate work based on the number of strips the graphics screen is 
subdivided into, 2) send the work out to the network of transputers 
for processing, 3) receive the results from the network and 4) 
monitor the time it takes to process each batch of work. A general 


model of the controller is shown in Figure 3.7. 
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controller 





gen.work 


work 


Figure 3.7 Linear Model Controller Process 


The complete boxed diagram represents the controller 
process. This is loaded specifically into the root transputer. Each of 
the circled processes are part of the overall process and they run in 
virtual parallelism. In other words, they are not truly running in 
parallel. The four processes are time sliced based on their priority 
levels. Since they are all specified to run at high priority, they each 
receive equal processing time. Therefore, no one process receives 
priority over any other. 

The arrowed lines between processes indicate designated 
software channels between those processes along which data may be 


passed. Arrows extending beyond the bounds of the controller signify 
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channels being routed along physical links to and from other 
hardware. The work and results channels are used to communicate 
with the network of workers while the to.graph and from.graph 
channels are used for sending and receiving data between the display 


and external filing hardware. 


2. Linear Algorithm 

The workfarm is where the workpackets are processed. The 
process which performs the calculations is consistent in each worker 
and in each topology. Any differences in the algorithm loaded into the 
network workers are due to communications requirements which 
depend on the specific topology being employed. It is also important 
to note that since more than one layer of workers is being used in 
every workfarm, physical links must be established between upper 
level workers and lower level workers in the hierarchy so that 
workpackets may be communicated to all workers in the farm. The 
process to be loaded into the linear workfarm is pixel.gen and is 


shown in Figure 3.8. 
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pixel.gen 





work.out 
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results.out 


Figure 3.8 Linear Model Pixel.gen Process 


The channels labelled work.in, work.out, results.in and 


results.out are along physical links and they are connected either to 
the TRAM (in the case of the first worker in the farm) or to other 
adjacent workers. The same discussion for processes internal to 
controller apply to the process pixel.gen. Note that pixel.gen has two 
levels of subprocesses. Processes router, gen.pixel and bypass all are 
at the same level while processes my.buffer, buffer and mix are all one 
level deeper. 

The linear workfarm is implemented in two schemes. The 


first scheme is a non-addressed mode in which the controller sends 


workpackets out to the network without specifying which worker is 
going to process the workpacket. The second scheme uses an 
addressed mode in which each workpacket contains the destination 
worker's unique address within the array. This allows a new 
workpacket to go directly to the worker that has returned a 
completed workpacket rather than to an opportunistic worker closer 
to the controller as the non-addressed mode would allow. 

The basic operation of the algorithm is as follows. Controller 
contains work.valve, display, gen.work and stopwatch. These four sub- 
processes are running in parallel within controller. Display initializes 
the network of workers, via gen.work, with initialization data 
necessary to carry out the processing of the workpackets (initialization 
phase). This is sent from the controller via the channel annotated 
work.out. This output is fed to each of the workers by entering the 
first worker in line via its input channel work.in. Once a worker has 
been initialized with the appropriate data it will signal back to 
work.valve to send workpackets. Each worker will first get one 
workpacket to be processed by gen.pixel, then will fill its buffer (store, 
within router, is not visible in these figures) to the limit set by the 
variable buffer.size. When this task has been completed, this worker is 
full and any more initial workpackets will be forwarded to the next 
worker in line via channel work.out from router and channel work.in 
of the next worker. This process is repeated until all of the initial 
batch of workpackets has been sent to the network. Since the linear 


model has been implemented using two schemes, non-addressed and 


tO 
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addressed, they have different methods for sending workpackets out 
to the array of workers. A integer value in display, called address, is 
set to either 999 or some other number. When address is set to 999, 
the non-address mode is used for workpacket distribution. Otherwise, 
workpacket distribution is done according to the address mode. 
During the distribution of the initial batch of workpackets for the non- 
addressed mode, a workpacket is shipped without regards to which 
worker will receive it. The first worker with an available buffer 
opening will absorb that workpacket. The number of workpackets to 
be sent out during this phase equals the number of workers in the 
array multiplied by the buffersize plus one. With a standard number of 
eight workers in the array, the number of workpackets initially sent 


would be: 


8 = number of workers 
#workpackets = 8*(10+1) = 88 — where: 10 = possible buffersize 
] = workpacket for processing 

In the non-address mode, it is conceivable that the first couple of 
workers will begin processing their buffered packets prior to all of the 
workers receiving an equal percentage of the initial allotment. 
Therefore, until processing overtakes the first processors, they may 
accept larger percentages of the initial workpacket allotment. As soon 
as the first processor completes a packet, it will return the results via 
my.buffer and mix. The results are then passed from pixel.gen to 
controller via the channels results.out and results,1 respectively. Once 


in controller, work.valve will send the results on to display. 


tO 
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In the addressed mode, work.valve sends the initial 
workpackets to the workfarm deterministically. Work.valve sends the 
initial workpackets as bundles to each worker in the array so every 
worker receives exactly the same initial number of workpackets. The 
number of workpackets in each bundle is the same and is equal to 


adding one to the current buffersize, i.e., 


#workpackets/worker = 10+1 = 11 where: { 10 = possible buffersize \ 


1 = workpacket for processing 


Consequently with eight workers in the workfarm the total number of 
packets shipped initially in this example would be 88. The major 
difference here is that each worker will get the same number of initial 
packets. 

The elapsed time for each batch of work was accomplished by 
triggering stop.watch to get a start time at the end of the initialization 
phase. Then after work.valve receives the results from the last 
workpacket, stop.watch is triggered again to get the stop time. The 
elapsed time is the difference between the stop time and the start 
time and represents the total amount of time taken to process all of 
the workpackets after the initialization phase for the network. 

Several parameters are important in this workfarm. One of 
the most important parameters is the variable stripsize. One goal 
might be to section a monitor screen up into parts (or strips) and then 
randomly generate a color for each pixel (one byte for each pixel) of 
that strip. Reducing the stripsize reduces the number of pixels (or 


bytes) that need to be calculated in that strip. Consequently the 
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processing time for a strip decreases along with stripsize. However, 
the other side of this tradeoff is that as the stripsize shrinks, the 
maximum number of strips required to complete the screen increases. 
These two variables are inversely proportional. In order to calculate 
the color for the screen randomly, a random number generation 
library routine is called. It is actually called twice, once in gen.work 
and then again in gen.pixel. Both results are used in gen.pixel. The 
value calculated by gen.pixel is compared against the value forwarded 
by gen.work. If the former is not within plus or minus the 
window.width of the latter, then gen.pixel recalculates a random 
number and does the comparison again. This procedure causes each 
workpacket to be processed for a different period of time. Once a 
match has been made within the constraint of the window.width, the 
random number created by gen.pixel is returned with the results. The 
reason for replacing the random number value with the one calculated 
in gen.pixel is to scale the random number to a value between O and 
295. Since there are only 256 possible colors per pixel, the calls 
produced by gen.pixel must produce results within the range of 0 to 
299. Obviously as the window.width requirement becomes more 
constrained, the longer the process will take to converge and 
processing time will increase. 

Every run of the workfarm decreases the stripsize by one half. 
The maximum Stripsize is 128 bytes and the minimum stripsize is 1 
byte. Each reduction in stripsize effectively doubles the maximum 


workload. It also reduces the amount of processing time for each 
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workpacket because each successively smaller stripsize contains only 
half as many bytes as the previously larger stripsize. By reducing the 
processing time per workpacket, the rate at which workpackets and 
results will be transmitted increases. It is possible that the first few 
workers will become a communications bottleneck while attempting 
to cope with the demands of more distal workers. This is because 
more distal workers have less of a requirement for communications 


due to their positioning within the network. 


3. Tree Algorithm 

The tree topology required some additional communication 
considerations that were not required in the linear model. Since the 
tree topology is a branching structure, allowances had to be made for 
communications to follow these branches freely in the non-addressed 
mode. Workpackets need to be able to go to any worker that is 
available. The algorithm was basically the same as the pixel.gen used 
in the linear algorithm with additional branching communication 
channels. The controller was modified to handle two output channels 
and two input channels to the network as shown in Figure 3.9. A new 
workpacket was sent to the specific branch on which the last results 


packet was received. 
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Figure 3.9 Tree Model Controller Process 
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There were two modified versions of pixel.gen. The first, 
pixel.gen2, was modified to handle three input and output channels as 
required by the second level workers. In addition, pixel.gen2 was also 
modified to allow a second level worker to reroute any workpacket 
among the third level workers in the non-addressed mode. This 
allowed a workpacket to be cycled through the third level workers 
until one that could process or buffer it was found. Figure 3.10 shows 
pixel.gen2 with the channel, reroute, which allows the rerouting of 


the workpackets as just discussed. 
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Figure 3.10 Tree Model Second Level Worker Pixel.gen2 Process 





The third level workers did not require any additional output 
channels for branched workers and was not modified to handle this. If 
additional workers were to be appended to the third level workers, 
then pixel.gen3 could be modified the same as pixel.gen2 to allow for 
communications. Pixel.gen3 does include a channel, reroute, which 
will send an unprocessed workpacket back up to the second level 
worker for redistribution to another third level worker. Pixel.gen3 is 


Showa in Figure 3.11. 
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Figure 3.11 Tree Model Third Level Pixel.gen3 Process 





IV. RESULTS 


A. Linear Topology 

The load balancing results [Ref. 5], produced by varying the 
stripsize and buffersize parameters, are graphically depicted in 
Appendix B and Appendix C. Appendix B contains the results for the 
non-addressed mode operation of the linear topology. Appendix C 
contains the results for the addressed mode operation. The graphs 
are broken down by stripsize with workpackets done per worker 
versus buffersize. Examining these graphs shows a wide range of load 
balancing. Large stripsizes corresponding to larger workpacket sizes 
resulted in more even load balancing for both the addressed and non- 
addressed modes of operation. For the large stripsizes, buffersize did 
not seem to have a significant effect on the load balancing. As the 
stripsize was decreased below 32, the load balancing became less 
symmetric with the most distal workers in the network taking on less 
work with small buffersizes. The load balancing appears to smooth out 
as the buffersize increases with stripsizes down through 8, but careful 
inspection of the actual loads shows that the further the worker is 
from the controller, the greater the load it will carry in terms of 
workpackets processed. This can be attributed to the decreased 
processing time associated with smaller workpackets. Decreasing the 
workpackKet size reduces the processing time required by any given 
worker. This results in a communications buildup that increases the 


more proximal to the controller a worker is. Those workers closest to 


the controller become inundated with communication demands to 
serve workpackets and receive results from the most distal workers 
because of the faster processing time due to smaller packets. For the 
non-addressed mode, serving workpackets to the network that are 
smaller than 16 bytes in length results in severely unbalanced loading. 

The addressed mode provides a slight advantage over the non- 
addressed mode in that there is a point of convergence for load 
balancing given any stripsize. A buffersize of 1 produced a reasonable 
load balance for all workers in the network regardless of stripsize. 
Distributing workpackets by request to the workers, vice the first 
available method used in the non-addressed mode, resulted in the 


pattern seen in Appendix C. 


B. Tree Topology 

Due to the bilateral topology of the tree, the load balancing results, 
as shown in Appendix D and Appendix E, were as expected for large 
values of stripsize. The load was shared evenly between both main 
branches of the tree. In addition the tertiary level workers shared 
identical loads within a single branch. The tertiary workers in one 
main branch also handled exactly the same amount of work as the 
tertiary level workers in the other main branch. Both workers in the 
second level handled the same amount of work, but processed fewer 
packets than any given tertiary worker. This was expected as their 
position in the network dictated their task in mediating 


communications between the controller and the tertiary workers. 
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Both the non-addressed and addressed modes responded similarly 
to the decrease in stripsize. The load balancing remained symmetric 
through a stripsize of sixteen at which point the load balancing 
became asymmetrical. As a result of faster processing time for smaller 
packets, whichever branch gained control of the initial workpackets 
first would continue to absorb most of them for the remainder of a 


batch of work. 


C. Comparison of Topology Performance 

In order to do a comparative analysis of the two topologies, timing 
results were obtained for both the linear and tree models for both 
operating modes over a range of workpacket sizes. Table 4.1 contains 
Sample results for each of these categories. These figures were 
obtained for a buffersize equal to 2 in all cases. In comparing the 
linear with the tree for any given mode and Stripsize, it is clear that 
the tree has a temporal advantage in processing a complete batch of 
work. Distributing work to the network deterministically using the 
addressed mode resulted in quicker processing times over the non- 
addressed mode. This was due to less processing time being required 
to determine if a worker could handle the work. The deterministic 
method merely checked to see if a worker was the destination worker. 
If not, the workpacket was passed on. For example, with a large 
Stripsize (128 bytes), the tree topology using the addressed mode is 
0.013273 seconds faster than the linear model using the non-address 


mode. 


5) 









a (s) 
Linear 
Tree 
| 8 3.277614 | 3.222384 | 





Table 4.1 Comparison of Linear Model and 


Tree Model Timing Results 
Although the linear model was simpler to set up, the tree topology 


did have some advantages which might make it more useful. In terms 
of timing, the tree topology was faster than the linear model. For 
many applications the difference in speed between the two topologies 
may not be of any consequence to the user. However, for applications 
where every additional margin of speed is of the essence, a tree type 
model may prove advantageous by creating the _ shortest 
communication distances, with the fewest intermediate nodes 
between a root controller and the most distal workers in the network. 
The linear model could potentially become bogged down in 
communication delays with increasing numbers of workers arranged 
in that topology. 

Load balancing considerations make the tree a better model for 
processor utilization. There was a greater percentage of processors in 


the tree network doing useful processing as compared with the linear 


model. This was particularly evident as the stripsize decreased below 


thirty-two. 
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V. CONCLUSIONS and DISCUSSION 

Difficulties encountered were centered around debugging. The 
Transputer Development System, TDS, that was used did not provide 
any debugging facilities. Therefore, attempting to localize bugs in the 
network was a time consuming and not altogether enjoyable task. The 
controller level of each of the networks was simpler to debug because 
it interfaced directly with the monitor controlling boards. However, 
the network of workers was essentially invisible to debugging except 
at the source code level. Possible alternatives to working in the 
Transputer Development System would be using the Profiler [Ref. 6] 
and NDB (network debugger) [Ref. 7] produced by Parasoft 
Corporation. The Profiler is a performance monitor composed of an 
Execution Profiler, a Communication Profiler, and an Event Profiler. 
The Execution Profiler monitors time spent in individual routines, the 
Communication Profiler evaluates the time spent in communications 
and I/O, and the Event Profiler shows interactions between processors 
and allows user-specified events to be monitored. NDB is a symbolic 
source and assembly level debugger for parallel computers. Using this 
tool it is possible to determine how far a program has run. The 
problem of not being able to completely monitor the network will still 
be a problem though. 

The tree network, running in the non-addressing mode, did not 
execute any batches of work beyond the first two within a stripsize of 
eight. Due to time constraints and the inherent difficulties with 


debugging, this problem remains to be resolved. Data from this 
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configuration with stripsizes ranging from 128 to 16 as compared to 
data from the other network configurations appeared to be accurate 
and was considered useful. 

The transputer has an inherent simplicity based on the CSP 
philosophy which makes it a valuable tool for workfarm research. 
Many topologies, such as a cube or loop, can be investigated quickly 
and simply using this device. Additional research should be 
investigated utilizing a shared memory schemes to enhance the utility 
of this device in problems requiring shared databases. In addition 
more effort should be applied towards the production of debugging 
tools for multiprocessor networks. 

Another approach to debugging small networks of transputers, 
i.e., less than 16, could lie in the CSP design philosophy of the 
transputer itself. A debug specific hardware board could be designed 
which taps one link to each transputer in the network thus enabling a 
debug program to monitor each of the processes executing in any or 
all of the transputers in the network connected to the debug 
hardware. Obviously this would defeat the connectivity of each of the 
transputers in the network to some degree. However, in the interest 
of providing more debugging capability and hence more robust 


programs, this may be an area worth investigating. 


Bas, 


APPENDIX A 


DETAILED SOURCE CODE - LINEAR TOPOLOGY 


-- link definitions 


VAL Jink0out four e: 
VAL  Jhinikie it eb ome: 
VAL@elinkZeuts Is. 2: 
VAL Jamk3outeetoer oe 
VAL linkOin LS “45 
VAL linklin 1S aoe 
VAL” dank2Zin LS ec. 
VAL link3in LS Va: 
-- declarations 

VAL numT8 1S .36= 
VAL numT4 Sse 
VAL numTs IS) numTlstnumil4: 


-- channel declarations 


CHAN 
CHAN 
CHAN 
CHAN 
CHAN 


OF 
OF 
OF 
OF 
OF 


ANY 
ANY 
ANY 
ANY 
ANY 


to-.grapn, FrOom.grapn 
graph.to.mouse,mouse.to.graph 
to.net ,from.net 


* 


to.time,from.time: 


[fnumT8+numT4]}] CHAN OF ANY results: 
fnumT8+numT4] CHAN OF ANY work 
VAL assignments 


VAL work.in 1S =o4,6,.6710,-0, 0, 6,400 
VAL work.out ES (37373703375) cm 
VAL results.in ES [Tie Ae ea ee) 
VAL results.out mon) ([Op2Z 25 2p ee) 


PROC controller (CHAN OF ANY stor grarn, 1 cen oral, vwOrN, GoSciulius) 
--Process which provides control for workfarm. Contains four internal 
--parallel processes; PROC work.valve, PROC gen.work, PROC display, and 
PROG StoOpewaten: 

FUSE “rayecomeat si 

#USE "\misclib\grafsymb.tsr" 

-- declarations 

CHAN OF ANY new.work,init,to.display, from.display: 

CHAN OF ANY to.time, from.time: 

CHAN OF ANY report: 


work.valve (CHAN OF ANY results,new.work, 

to.time,from.time,to.display,from.display, report) 
~-PROC work.valve monitors the network for incoming results which it 
--sends to PROC display. At the beginning and end of each batch of 
--work, PROC work.valve gets the current time from PROC stop.watch. 
~-At the end of a batch of work, PROC work.valve computes the elapsed 
==-time the metwork took to process that batcnwomenork 
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-- declarations 

FUSE. raycom.tsr" 
INT32 seed: 

INT64 compares: 

INT command, maxwork: 
INT num, packets, buffer.size: 
INT time, lastT: 

INT address, j: 

UND ox; Y: 

INT xrmax: 

iNT Sstripsize: 

INT randnum, workdone: 
INT window.width: 

[Nr 1ters: 

ime oS BYGeepaxels: 
BOOL active: 


SEQ 
active := TRUE 
WHILE active 
ALT 
results ? command 
IF 
command = c.result 

SEQ 


results ? x;y; [pixels FROM 0 FOR stripsizej;address 
EOncuoplay — 76, result; x; y; 
[pixels FROM 0 FOR stripsize} 


new.work ! address 
workdone := workdone + 1 
Te 
workdone = maxwork 
SEQ 


testame | SPRUE 
report { TRUE 
EGSE 
SKIP 
eOnmand = C. report 
SEQ 
from.time ? time 
EO-display !¢.time; time 
command = c.report.data 
SEQ 
results ? packets;compares;num 
to.display ! c.report.data;packets;compares;num 
command = c.init 
SEO 
results ? seed; window.width;rmax; 
stripsize;lastT;buffer.size 
to.time ! TRUE == Sslice we awbetey 
alga 
address = 999 
SEQ 1 = 0 FOR iters 
new.work ! address 


oi 


TRUE 
SEQ 1 = 0 POR iters/ (buffer. sizer!) 


SEOua 0 FOR buffer.size 
new.work ! (i+1)*100 
command = c.test 


SHae) 
results ? command 
tO-display !)¢- teste, conmana 
TRUE 
Sul 
-- initialization sequence 
from.display ? stripsize;lastT;address;buffer.size 


SEQ 
iters := lastT TIMES (buffer.size+tl1) 
maxwork := (512%*512)/stripsize 
workdone := Q 


— ome ee ee es es es es es es es es es es es es es es es es ee es es es ee ee ee es ee es ee ee ee 


PROC gen.work (CHAN OF ANY in,out, init, report) 
-~-PROC gen.work creates new work as dictated by PROC work.valve. Work 
--is only sent to the network as results are returned. The one 
--exception to this is the initialization of the network for each 
--batch of work. At that time PROC display provides the initial 
-~-values for the parameters to be used by the network in doing its 
-~calculations. PROC display sends this information to PROC gen.work 
--which then sends it to the network. When a reply is received by 
--PROC work.valve that the network has been initialized then a trigger 
--is sent to PROC gen.work to send a specific number of workpackets to 
--the network. As results are received by PROC work.valve, PROC 
--gen.work sends new workpackets. 

-~- declarations 

#USE "\tdsiolib\fpmath8.tsr" 

#USE “raycom.tsr" 

INT32 seed, Tseed: 

INT x,y,lastT,buffer.size: 

INT address: 

INT maxrand: 

INT stripsize: 

INT randnum;: 

INT window.width: 

INT maxwork,workdone: 

[128] BYTE pixels: 

BOOL active,trigger: 

PROC random (INT rnum, VAL INT rmax) 

--PROC random is a call to a library process which generates a 

=—=random number. 


REAL32Z2 result: 


SEQ 
RANP (result, seed) 
rnum := INT ROUND (result* (REAL32 ROUND rmax) ) 


SEQ 
aceive := TRUE 
seed := 245786 (INT32) 
WHILE active 
il 
in ? address 
LE 
workdone < maxwork 
SEQ 
random (randnum,maxrand) 
out ! c.ray;x;y;randnum; address 
workdone := workdone + 1 
xo xX) + eStraos ize 
IE 
ae ey alt 
SEQ 


TRUE 
SKaiP 
ime eT Seed, window. width;maxrand;stripsize; lastT;buffer.size 
SEQ 
out ! c.init;Tseed;window.width;maxrand;stripsize; 
lastT;buffer.size 
maxwork := (512*512)/stripsize 
x := 0 


workdone := 0 
report ? trigger 
Gilite eC. Lepore 


PeOCcwGisouay stCHAN OF ANY in,out, to.graph,from.graph, init) 
--PROC display generates the initial data required by the network to 
--process the workpackets. It sends this data to PROC gen.work which 
--in turn sends the initialization data to the network. PROC display 
--also receives incoming results from PROC work.valve and sends them 
--to a BOO7 graphics controller which displays the results on a color 
--monitor. 

INT reply: 

INT64 compares: 

ivi num, oackets,total: 

INT maxwork, reports,time,buffer.size: 

it x, yc Ommand : 

iiieaddress, ),k,1: 

tiie lOS, ¥.D0S : 

INT mode,len: 

iii ewindowowldreh, rmax, stripsize,; lasttT: 

(4,/INT params: 

[128] Byte epixels: 

BOOtem.., item, m.r: 

BOOL active,not.done: 


a8 


VAL seed 1S 1l2Zj4ee eer iis 


PROC set.scroll (VAR ENT tCep, boLrom 


mode := 0 

params [0] VT220.set.scroll 
params[1] := top 

params[2] := bottom 

to.graph ! ¢c.to.VT220;mode; params 


mode := 0 
params[0) := vVT220.clear.screen 
to.graph ' €:3to-.ViIZ20; mode; params 


mode := 0 
params [0)]) += VIZ207 cab 
Lo.graph =!" ¢.to0.Vizz0, mode ,earams 


mode := 0 
params[Q] := VT220.return 
tO.gQraph 2c. to.VIizZz207 mode; params 


mode := 0 
params[0]) := VT220.print.screen 
to.graphn!)¢.to.VIZ20;mode; params 


SEQ 
mode := l 
params[0) := VT220.num 
params[1l] := num 


to.graph ! c.to.VT220;mode; params 


PROC write .num.xy (VAL INT mum.) 
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SEO 
mode := 1 
params (0) “:= VIZ20 .num. xy 
params[1] := num 
params[2] := x 
params[3] :=y 
EO. Grape sG-te.,VI2Z20; mode; params 


PROC write.text (VAL [] BYTE text) 


mode := 2 

params[0]) := VT220.text 

len c=]="5I1ZE text 

tO.graph ! ¢€.to-VI2Z20;mode;params; len; text 


— ome eee cee oe coe om cee ee cee ee ee oe om om om cee com ee ee ee oe om oom com ce em om oe om ome cme cm cm cme ee ws ce ew we we es os es es es ee ee es ee es ee a ie ee 


— oe ae ae cee cee eee cee cee eee oe cee cee cee cee cee we we we wc ee ee ee es es ee es es ee ee ee ee es es ee ee ce ce ce ce ce ec ee ee ee ee ee ee ee eee 


mode := 2 

Params (0) F:= VIZZ0.Cext. xy 

params [2] x 

params[3] :=y 

len := SIZE text 

£O.graph ! c.teeVvizZ20;mode, params, len; text 


mode := 0 
params[(0) := vT220.highlight 
to.gzeapn !' €.to.ViIZ20;mode; params 


oe eee ee ee ee ee ee ee ee eee ee ee ee ee ee ae ae ee eee cee cme cee cee cee cme ce cm ee eee ee ewe ee es es ee es ee ew we = 


mode := 0 
params[0}) := VT220.underline 
to.graph ! c.to.VT220;mode; params 


— me cee cme ees ee ee es ee ee ce ce ce ce ce cc cc ce cr ec ee ee ee ee wee ew ee ew ee ee ew ww ie = 


SEQ 
Mees = PE ALOE 
m.m := FALSE 
Mel. 2S ab AbSe 
to.gquaph ' ¢c.get mouse 


LrOm.gGraph ? x. pOCS py. pos. mm. Lameman <r 
WHILE (NOT m.r) AND (NOT m.m) AND (NOT m.1) 
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cao 
tO: graph | we .-gerrmouse 
from.graph 2 x ,pOS7y-poOs 7m. ., tem. meen 


SEQ 

active := TRUE 

window.width := 1000 

lastT := 8 

rmax := 1000 

Stripsize := 128 

address := 0 

aim Oo 

Kae — a0 

1 := 0 

WHILE (stripsize >= 1) 

SEQ 
SEQ buffer.size = 1 FOR 20 
Sao 

== COler monicer 
toO.gqraph Pico nide -curcser 
£roOm. Grable cer, 
to.graph | Colne cri sony. 
from.graph ? reply 
to.graph ! c.select.screen; 0 
From.grapne es .ep iy. 
to. grap ! c.clear.screen; 0 
from.graph ? reply 
to.graph ! c.display.screen;0 
from. graph 7ereply 
GOL Graph  -c¢. select ecolour.table?;0 
from.graph ? reply 
oe ee) 


clear.screen () 
SetsSeroll is, 4) 
highlight () 
ee 
address = 999 
write.text.xy ("Work Farm - Linear/no address 
mode",1,20) 
TRUE 
write.text.xy ("Work Farm - Linear/address mode",1,20) 
high liGit () 


to.graph ! c2teehost;- windew widen 
toO.graph !' "ecseeenost-stripsize 
to.graph ! c.to.host;buffer.size 


Write .teExt. xy (Windewswidtn 57) 
write.num.xy “(window width, 3,12) 
write.text.xy ("# Transputers: ",3,50) 
write.num.xy (lastT, 3,65) 

write.text.xy ("Random number range 0 - ", 4,1) 
write.num.xy (rmax, 4,25) 

write.text .xy (St ripsi7e sau) 


Write mim. xy om(st rapsaze, 4, 61) 

write .texe aye Butter size: ", 5,750) 

write.num.xy (buffer.size,5, 63) 

Cebet urna) 

underline () 

write.text ("Transputer # Work Done Compares") 
underline () 

Gereturn. 

e.reture () 


out ! stripsize;lastT;address;buffer.size 
init ! seed;window.width; rmax;stripsize;lastT;buffer.size 
reports := 0 
not.done := TRUE 
maxwork := 0 
WHILE not.done 
SEQ 
in ? command 
aS 
command = c.result 
SEQ 
in ? x;y; [pixels FROM 0 FOR stripsize] 
to.graph ! c.mandelbrot;stripsize;x;y; 
[pixels FROM 0 FOR 


stripsize] 
maxwork := maxwork + 1 
Command = ¢c.report.data 


SEQ 
in ? packets;compares;num 
reports := reports + l 
EOnGuarhiu sec. cO.nost ;num 
EOugrapie ec.tO.nost;packets 
CO.gnbapn + -c.tO,Nnost; (INT compares) 


Wreitentext (" -) 
write.num (num) 
tab () 
IF 
num < 999 
tab () 
TRUE 
SKIP 

write.num (packets) 
tab () 
tab () 
write.num (INT compares) 
e-reeurnm (© 

command = c.time 

in ? time 
command = c.test 
SEO 

in ? command 
write.text ("Test ") 
write.num (command) 
emreturn’ 
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TRUE 


SKIP 
hE 
reports = lastT 
SEQ 
to.greph |! ¢.towhnese;maxworks 
EO.Graph "e2bo-Nost; tame 
C.return () 
write text C total work-)) 
write.num (maxwork) 
Gerecturneed) 
write.text ("Time (usec): ") 
write.num (time) 
c.return () 
not.done := FALSE 
TRUE 
SKIP 
Stripsize := stripsize/2 


PROC stop.watch (CHAN OF ANY in,out) 
--PROC stop.watch provides timing information from the high priority 
--clock running an ilsecteiexcs. 9 PROC Stop.watch 1S Calleawin this 
--algorithm by PROC work.valve 

INT start, £2nwsi- 

TIMER clocks. 

BOOL active,toggle: 


SEQ 
active := TRUE 
WHILE active 
PRI PAR 
SEQ 
nha ? toggle 
Clock jestart 
alr ? toggle 
clock) 4reanwsh 
ey bh s ! (finish-start) 
SKIP 
PRI PAR 
PAR 


work.valve (results,new.work,to.time, from.time, 
to.display, from.display, report) 
gen.work (new.work,work, init, report) 
display (to.display, from. display, to, gqrapmeerem-grapn, init) 
Sstop.watch (to.time, from.time) 
Sire. 


—— a oe ee ee ee ae ae ae ae eee cee cee ame ame c= Ge f= oe ae ame eee em am ee oe om om oe om om oe we ee om oe cm oe oe ice oe ee om om oe oe om om oe om oe oe om oe ee oe om oe oe oe ow om ow = ow = = == 
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VAL INT mynum) 
eeeRel pixel gen is Y*un ey each worker in the workfarm. It contains 
--processes which control the routing, processing and buffering of 
--workpackets and processes which control the flow of results back up to 
woEne Controller. 


— ee ee es ee ee ee ee eee ew ses es es es es es es es es es es es es es es es es es es ee ee ee ee ee ee ee em em ae ee ee ee ee ee ee een ew ow ew ew ewes ss 


fs. Laycom,tsr™ -- ray tracer command definitions 
=—-"declarations 
iN OEBANY £Ol,rayser, &rOmuray tr, -equcsimmere> —-—-mmnternalgechannels 


PROC router (CHAN OF ANY work.in,work.out,to.ray.tr, requestmore, VAL 
INT mynum) 

--PROC router controls the flow of workpackets. If a given worker in 
--the network cannot handle the current workpacket it is either 
--buffered or sent to another worker in the network. 

=-—- declarations 

musk "“raycom.tsr™ -- ray tracer command definitions 

BOOL active 

BOOL busy 

BOCL burferfume: 

VAI; bDUuEEmax Se 20: 

VAL empty foe: 

[burimax]) ENT xtemp: 

[buffmax] INT ytemp: 

{buffmax] INT rantemp: 

{buffmax] INT addresstemp: 

INT address,buffer.size: 

INT command, lastT 

INT32 seed,myseed: 

INT window.width, rmax, stripsize: 

INT randnum: 

INT packets.done: 

TNT ax, v: 

ENT Duff coummt : 

BYTE sendmore: 


PROC store (INT x,y, randnum, address) 
--PROC store is the buffer for workpackes in any given worker. If 
--the worker is not busy and PROC store has available workpackets, 
--one will be sent by PROC send to PROC gen.pixel 
SEO 
DUpmeoumt ~— Dburtfcount + 1 
si 
(eeefcount =a(butfer.saze-1)) OR (bufftcount = (buffmax-1) ) 
Sal, 
bufferfull := TRUE 
xtemp [buffcount] 
vVeemp i(butitcount] := 
rantemp[buffcount] randnum 
addresstemp[buffcount] := address 


We 


ASO; es 
SEQ 


Xtemp [buttesune | —e= 

ytemp [bu=EcountE] :=— y 

rantemp [buffcount] := randnum 
addresstemp[buffcount] := address 


PROC send () 

~-PROC send removes workpackets from the workpacket buffer and 
--routes them to PROC gen.pixel if PROC gen.pixel is not busy and 
~-~the buffer has at least one workpacket. 


to.ray.txo ! ©. tay-stempiloutfcoune |); yeeme | cure eeoume- 
rantemp [buffcount]);addresstemp [buffcount } 


bufferfull := FALSE 
packets.done := packets.done + 1 
LF 
buffcount > empty 
buEECOuUnGS 7 = BuUEreounse s—) . 
TRUE 
Sie 
SEQ 
active := TRUE 
WHILE active 
PRI SALT 
busy & requestmore ? sendmore 
iE 
buffcount = empty 
busy := FALSE 
TRUE 
send () 
-~(NOT busy) & requestmore ? sendmore 
ates 9) Ss 2 
work.in ? command 
16 
command = c.ray 
SEQ 
work.in ? x;y;randnum; address 
IF 
address = 999 
IF 
NOT busy 
SEQ 
tO. ray. tEeec. rayex; Vv; candnum, address 
packets.done := packets.done + 1 
busy := TRUE 
DUSrerrul. 
work.out ! c.ray;x;y;randnum; address 
TRUE 
store (x,y, randnum,aderesa) 
address = mynum 
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TE 


NOT busy 
SEQ 
MOA mee GrLavs Vy, Dalcdmim, address 
packets.done := packets.done + l 
busy := TRUE 
TRUE 
store (x,y, randnum, address) 
TRUE 
WOrMOliLen) “Gerdy,o0- Y,;randnum;address 
Conmand = C.imae 


SEQ 
-- initialization sequence 
work.in ? seed;window.width; rmax; 
stripsize;lastT;buffer.size 
myseed := seed 
seed := seed + 1231 (INT32) 
work.out ! c.init;seed;window.width; rmax;stripsize; 
lastT;buffer.size 
EOendy ae ecw inie, myseed,window.width;rmax;Stripsize 
packets.done := 0 
betterrud) 3— FALSE 
buffcount := empty 
busy := FALSE 
command = c.report 
SEQ 
work.-owe! Clsreport 
tO.rayeer ! GClreport.data; packets .done 
command = c.test 
SEQ 
work.in ? command 
work.out ! c.test;command 
TRUE 
SKIP 


PROC Bypass(@HEN@OFr ANY results .in, results.out,from.ray.tr) 

--PROC bypass receives results from PROC gen.pixel or from other 
—--workers 

--downstream trying to send their results up to the controller. It 
--provides a 

--control point for the flow of results to the controller. 


-- declarations 
fUsr “Gavcom.tsr™ -- ray tracer command definitions 
CHAN OF ANY mine,theirs : -- internal channels 


PROC my.buffer (CHAN OF ANY in,out) 
--PROC my.buffer receives results from PROC gen.pixel within that 
-~-specific worker and forwards that data on to PROC mix for 
J Painamissmean to the Comerol ler 
—-- declarations 
Tomer taveontiat or’ -- ray tracer command definitions 
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INT command, address: 

INT ox; y : 

INT32 seed: 

INT mynum, packets.done: 

INT64 compares: 

INT randnum, window.width, rmax, stripsize: 
[128] BYTE pixels. 

BOOL active: 


SEQ 
active := TRUE 
WHILE active 
SEQ 
in ? command 
ie 
command 
SEQ 
in x;v; [pixels FROM 0 FOR stripsize];address 
Out! ¢. resus, se 
[pixels FROM 0 FOR stripsize];address 
command = CC. pepo fee aara 


GC. resule 


ES) 


SEQ 
in ? packets.done;compares;mynum 
out ! c.report.data;packets.done;compares;mynum 


command = c.init 
in .? Striapsize 
command = c.test 


SEQ 
in ? command 
out ! c.test;command 
TRUE 
SKIP 


PROC buffer (CHAN OF ANY in,out) 
--PROC buffer receives results from downstream workers and forwards 
--the results to PROC mix for transmission to the controller 
-- declarations 
#USE. “it ayeomaticu.. -- ray tracer command definitions 
INT commana, last], buffer size: 
INT x,y,address: 
INT32 seed: 
INT mynum, packets.done: 
INT64 compares: 
INT randnum, window.width, rmax, stripsize: 
(128) BYte pixels: 
BOOL active: 


SEQ 
active := TRUE 
WHILE active 
SEQ 
in ? command 
IF 


48 


command = c.result 
SEQ 
in ? x;y; [pixels FROM 0 FOR stripsize];address 
CUE C.result; sn 


[pixels FROM 0 FOR stripsize];address 
command = c.report.data 


Sa 
in ? packets.done;compares;mynum 
out ! c.report.data;packets.done;compares;mynum 
command = c.report 
Oiesn «CC. report 
command = c.init 
sae, 


in ? seed;window.width; rmax; 
Stripsize;lastT;buffer.size 
out ! c.init;seed;window.width; rmax;stripsize; 
lastT;buffer.size 
command = c.test 


SEQ 
in ? command 
Guten ye. vest, command 
TRUE 
SKIP 


PROC mix (CHAN OF ANY mine,theirs,out) 
--PROC mix acts as the control point for sending results from this 
--worker or from a downstream worker depending on whether the active 
--input channel is from PROC my.buffer or PROC buffer respectively. 

-- declarations 

#USE “rayeem.tsr" -- ray tracer command definitions 

INT command, lastT,buffer.size: 

ENIX, y,eOdadnEess: 

INT32 seed: 

INT mynum, packets.done: 

INT64 compares: 

INT randnum, window.width, rmax, stripsize: 

PrZe le BYTE. pixels < 

BOOL active: 


SEO 
active := TRUE 
WHILE active 
ALT 
mine ? command 
EE 
command = c.result 
5) a8, 
mine Wix-v; ipixels FROM O FOR sStripsize];addredss 
OU: Jee resuley x; v: 
{pixels FROM 0 FOR stripsize];address 
command = c.init 
SKIP 
command = c.test 
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SEQ 
mine ? command 


Out, 1. Ceeest ;cemmand 
command = c.report.data 
SEQ 
mine ? packets.done;compares;mynum 
out ! c.report.data;packets .done; compares;mynum 
PRUE 
SATE 


theirs ? command 


i 
command = c.result 
SEC 
theirs ? x;y; [pixels FROM 0 FOR stripsize];address 
Cute c. resule: 7 
[pixels FROM 0 FOR stripsize];address 
command = c.init 
SEO 


theirs ? seed;window.width; rmax; 
stripsize;lastT;buffer.size 
out ! c.init;seed;window.width; rmax;stripsize; 
lastT;buffer.size 
command = c.test 


SEQ 
theirs ? command 
out ! ¢.test ;command 
command = c.report 
out ! Gereport 
command = c.report.data 
SEQ 
theirs ? packets.done;compares;mynum 
out ! c.report.data;packets.done;compares;mynum 
Asd0 oe 
SKIP 


PAR 
my.buffer (from.ray.tr,mine) 
buffer (results.in,theirs) 
mix (mine,theirs, results.out) 


PROC gen.pixel (CHAN OF ANY in, requestmore,out,VAL INT mynum) 

--PROC gen.pixel processes the workpackets by generating random 
--numbers until one of them falls within a specific tolerance 
--(window.width) of a random number generated by PROC gen.work in the 
--controller process. The results are sent to the local PROC 
--my.buffer and then to PROC mix for transmission to the controller 
--either directly or via an intermediate worker. 


— = ome ae ame ae aes ae cee ame ae ae ome cee ame am oe om cm ce om om cm om om om om om cm om om om om om om oe om oe oe om om om om oe om om oe ome om oe oe oe ow c= om om oe oe c= oe om we ow ow oe c= ow os = = 


-- declarations 

#USE "¥aveomersr -- ray tracer command definitions 
#USE “\NEGSiOlie\Vepmathe.tor. 

BOOL active,closenough: 


INT32 seed: 


INT command, address: 
INT ye 
INT packets.done: 


INT64 compares: 

INT window.width: 

init erandnum: 

rt. st Presi Ze: 

Pee num, rina x 

INT temp: 

[28] BYRE pixels: 

VAL sendmore IS 0 (BYTE): 


PROC random (INT rnum,VAL INT rmax) 
--PROC random is a call to a library process which generates a 
--random number. 
REAL32 result: 
SEQ 
RANP (result, seed) 
rnum := INT ROUND (result* (REAL32 ROUND rmax) ) 


SEO 
active := TRUE 
WHILE active 
SEO 
in ? command 
Ede 
command = c.ray 
SEQ 
in ? x;y;randnum;address 
SEQ i= 0 FOR stripsize 
SEQ 
random (rnum, 255) 
pixels{i] := BYTE rnum 
random (rnum, rmax) 
closenough := FALSE 
WHILE NOT closenough 
SEQ 
Eenp a: — CnuUmM — randnum 
1s 
temp < 0 
temp := temp TIMES (-1) 
TRUE 
SEE 
compares := compares + 1 (INT64) 
1a) 
window.width >= temp 
closenough := TRUE 
TRUE 
random(rnum, rmax) 
out ! c.resultys,. [pixers neem 0 FOR stripsize];address 
requestmore ! sendmore 


Sal 


command = c.init 


SEQ 
in ? seed;window.width; rmax;stripsize 
out, ! Glinieystrapsize 
compares := 0 (INT64) 
command = c.report.data 
SEQ 
in ? packets.done 
out ! c.report.data;packets.done;compares;mynum 
command = c.test 
SEQ 
in ? command 
out ! c.test; command 
TRUE 
See 
PRI PAR 
-- high priority 
PAR 


router (work.in,work.out,to.ray.tr, requestmore, mynum) 
bypass (results.in, results.out, from. ray.tr) 

-- low priority 

gen.pixel (to.ray.tr, requestmore, from.ray.tr,mynum) 


PROC graph(CHAN OF ANY in, out, from.mouse, to.mouse) 


#USE “\misclibNegzatprocemsu. 
#USE “\misclib\Gvatsymbets. 
graphics(in, out, from.mouse, to.mouse) 


PROC VTI220mouse (CHAN OF ANY £rom Grappa, to. o1 apm, nt omenmet, LO-nNeE) 


#USE “\miSelib\Veermmetss” 
#USE '\miseciibVGrarsymoytor 4 
+#USE “\misclip\ mouse ter! 


mouse (from.graph,to.graph, from.net,to.net) 


PLACED PAR 


PROCESSOR 2 T4 -- mouse/terminal process 
PLACE “mouse. to.Gqraph Al Jamhoeue. 
PLACE graph.to.mouse AT lankoan 
PLACE from.net AT lainkZeuk = 
PLACE “to2net ATs Je nk 2am 


VTZ220mouse (graph.to.mouse,mouse.to.graph,from.net,to.net) 


PROCESSOR 714 == gfaphics process 
PLACE “te gorach AT) ainies a 


PLACE from.graph Al Jink3sour. 
PLACE mouse.to.graph AT linkOQin 
PLACE graph.to.mouse AT linkOout: 


Grapmveo-oLrapl, :rom.gLapm, Mouse toe.glapil, gGrapl.cto.mouse) 


PROCHSeOR 10 TS == controller process 
PLAGE to.graph AT linkloute 
PLAGs from.guaph AT linklin 
PLACE work[{0] AT linkOout: 


PLAGE results {0} = Ads lank0in 
controller (to.graph, from.graph,work{0)],results([0]) 
PLACED PAR i = 0 FOR numT8-1 -="T800 ray tracers 


—-—-99300 ray tracers 
PROCESSOR ((1i+1)*100) T8 


PRACE Me work {i} AT © work .,4371@ | 
PLACE work[it1] Am work. out i] 
PLACE results{[i] Pine silts. Out | as) : 


PLACE results[{it1l] AT results.in{i] 
Parcet Gen (word) 1), vonk aeeiascsmlis[ij,cesults {itl}, ((1+1)e100) ) 
PROCESSOR ( (numi@) +100) ers 
PLACE work[numT8-1] AT work.in[numT8-1] 
PLACE results{numl8-1] AT results out [numT8-1]: 


pixel .gen (work [numT8-1],endloopT8, results [numT8-1], 
endloopTs, ((numTs) ~10u)) 


Do 
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